Monitoring insect biodiversity and comparison of sampling strategies using metabarcoding: A case study in the Yanshan Mountains, China

Abstract Insects are the richest and most diverse group of animals and yet there remains a lack, not only of systematic research into their distribution across some key regions of the planet, but of standardized sampling strategies for their study. The Yanshan Mountains, being the boundary range between the Inner Mongolian Plateau and the North China Plain, present an indispensable piece of the insect biodiversity puzzle: both requiring systematic study and offering opportunities for the development of standardized methodologies. This is the first use of DNA metabarcoding to survey the insect biodiversity of the Yanshan Mountains. The study focuses on differences of community composition among samples collected via different methods and from different habitat types. In total, 74 bulk samples were collected from five habitat types (scrubland, woodland, wetland, farmland and grassland) using three collection methods (sweep netting, Malaise traps and light traps). After DNA extraction, PCR amplification, sequencing and diversity analysis were performed, a total of 7427 Operational Taxonomic Units (OTUs) at ≥97% sequence similarity level were delimited, of which 7083 OTUs were identified as belonging to Insecta. Orthoptera, Diptera, Coleoptera and Hemiptera were found to be the dominant orders according to community composition analysis. Nonmetric multidimensional scaling (NMDS) analysis based on Bray–Curtis distances revealed highly divergent estimates of insect community composition among samples differentiated by the collection method (R = .524802, p = .001), but nonsignificant difference among samples differentiated according to habitat (R = .051102, p = .078). The study therefore appears to indicate that the concurrent use of varied collection methods is essential to the accurate monitoring of insect biodiversity.

Because of the anthropogenic pressures, up to 1 million fauna species will be facing rapid extinction within decades (Tollefson, 2019).
It is becoming increasingly urgent for us to have effective systems with which to observe and assess changes in biodiversity (Bohmann et al., 2014;Clare et al., 2022).
Historically, biodiversity monitoring has relied on the identification of species and counting of individuals by taxonomists, requiring substantial expertise, time, funding and specialization (Basset et al., 2012;Campbell et al., 2011;Wijerathna & Gunathilaka, 2020).
When there is shortage of any of these, conservation scholars and environmental managers make decisions based on limited information. Given the pace of biodiversity loss, there is a need for methods of inventory that can be applied quickly and efficiently in diverse contexts, including those about which prior information is limited.
Nearly 20 years ago, the initiative of DNA barcoding offered an opportunity to identify specimens for researchers beyond the taxonomic expertise through getting a standard sequence of genes from a single sample and then searching against a reference library (Ekrem et al., 2007;Hajibabaei et al., 2007;Hebert et al., 2003Hebert et al., , 2004. Compared with the initial approach of DNA barcoding, DNA metabarcoding makes the progress less costly but more rapidly (Braukmann et al., 2019;Valentini et al., 2016), which features it feasible to analyze the composition and structure of entire insect communities, and become a widely accepted and successfully applied in the studies of community ecology (Bittleston et al., 2022;Keck et al., 2022;Marquina et al., 2019;Piper et al., 2019;van Zinnicq Bergmann et al., 2021). The mitochondrial cytochrome c oxidase subunit I (COI) has the advantages of conservative, faster evolution rate and the large amount of insect taxonomic information available for this gene in the online repositories (Nehal et al., 2021;Ren et al., 2011). Therefore, COI locus has been the most widely used marker for metabarcoding of insects (Zhang & Hewltl, 1996).
Another historic issue facing the accurate study of insect biodiversity is the way in which methods of collection may have biases towards certain taxa. Insects samples are collected via various methods, chiefly: sweep netting, soil extraction, leaf litter collection, canopy fogging and traps (McGavin, 1997). Sweep netting is a common net type used in insect sampling which is used to capture insects flying or hovering over foliage. Malaise traps are an easily-handled, low-cost sampling tool which can capture flying and also wingless insects day and night. Light traps have been widely used in nocturnal insect sampling. And with robust sampling devices, light traps can collect high numbers of specimens (Yi et al., 2012). Previous studies into insect biodiversity have predominantly used one or two collection methods (Hausmann et al., 2020;Hwang et al., 2022;Kaczmarek et al., 2022;Wirta et al., 2016), and have rarely made use of three or more in combination. Different sampling methods are generally used to collect different kinds of insect taxa, with methods ordinarily chosen in accordance with the active behavior of the target (nocturnal, diurnal, low-flying, aquatic, etc.) (McGavin, 1997;Yi et al., 2012).
Two methods deployed in the same location may yield very different samples. If a sample of an area is to provide representative data on community composition, then it is key to understand how different methods of collecting will capture, miss, or otherwise proportionally affect the various taxa contained within that community.
The Yanshan Mountains are a major mountain range located to the north of the North China Plain and the south of the Inner Mongolian Plateau, containing a large number of habitats and diverse species.
Positioned within one of the three largest urban agglomerations in China, containing 10 cities and with a total area of 21.7 million ha and a population of 110 million people (Cui et al., 2019;Li et al., 2020), the region is facing increasing anthropogenic pressures. Several studies have been conducted on the biodiversity of the region, including of bacteria (Tang et al., 2016), fungi (Zhou et al., 2022), flora (Bu et al., 2014;Tang et al., 2017) and large fauna (Xie et al., 2022) and arachnids (Lu et al., 2022); two have focused on insects: one on the relationship between the diversity of flower-visiting insects and habitat types (Han et al., 2022) and another on moth communities (Hao et al., 2020).
In this study, we collected samples via three collection methods (sweep netting, Malaise traps and light traps) from five different habitats experiencing different degrees of anthropogenic impact (farmland, grassland, scrubland, wetland and woodland) and performed DNA metabarcoding. The objectives of the study were (a) to establish the insect community composition of the Yanshan Mountains through DNA metabarcoding, (b) to compare alpha diversity and beta diversity among the three collection methods and evaluate the effect of different methods on the collecting of different insect groups, and (c) to compare alpha diversity and beta diversity among different habitats and identify the dominant insect groups in the five different habitats.

| Sampling
We divided the Yanshan Mountains into grids of 10 km 2 and selected 30 for field sampling. To ensure inclusion of different degrees and types of anthropogenic impact, grids were categorized into habitat types (farmland, grassland, scrubland, wetland and woodland) and each type was sampled. The samples were collected via three methods: sweep netting, Malaise traps and light traps. Sweep netting and

T A X O N O M Y C L A S S I F I C A T I O N
Biodiversity ecology, Community ecology, Conservation ecology, Entomology, Taxonomy, Zoology Malaise traps were set once per grid, and light traps once every two grids, due to inconvenient electricity provision and the steep terrain.
All samples were collected between July and August 2019. In total, 74 samples (29 from sweep netting, 30 from Malaise traps and 15 from light traps) were obtained for the study (Figure 1; Table S1).
For the sweep netting, two sampling lines were set, each of a length of at least 200 m, and 100 sweeps performed per line. Malaise traps were left open for one each week, equipped with a jar of 99% ethanol. Light traps were run from 7:30 p.m. to 9:00 p.m. The samples were stored in 99% ethanol, except Lepidoptera specimens which were stored in triangular bags. Preservative ethanol was replaced every 24 h during collection trips (three times) and then a final time at the laboratory. The samples were then stored in a freezer at −20°C for later use.

| Sample preparation and DNA extraction
Raw samples from each site were prepared individually by pouring the contents of the collection bottles into sterile petri dishes, sorting the insect specimens from other organisms, and then transferring them into tubes until the residual ethanol evaporated completely.
To avoid the contamination from the gut content and the scales from wings, in the case of lepidopteran, specimens' wings and abdomens were removed before extraction. Specimens gathered from the same grid via the same collection method were then pooled together and treated as a single bulk sample. Each bulk sample was then ground with a sterilized pestle to homogenize the tissue. To reduce systematic errors, triplicate samples for DNA extraction were aliquoted from each bulk sample. Genomic DNA was extracted with an E-Z 96® Mag-Bind Soil DNA Kit in accordance with the manufacturer's protocol, and the mass and volume for each DNA extraction were checked (Table S2). A negative (no DNA template) control was processed with each extraction batch to monitor contamination.

| Bioinformatics and data processing
Sequence processing was performed with cutadapt v2.3 (Martin, 2011) and vsearch v2.13.4 suite (including fastq_mergepairs, fastq_filter, derep_fulllength, cluster_size and uchime_denovo) (Rognes et al., 2016). Initial quality control was carried out with fastQC. Cutadapt v2.3 was used to trim primers and remove sequences with unmatched primers. Pair-merging was performed using fastq_mergepairs with 202 bp overlap. Quality filtering was F I G U R E 1 Map of sampling sites in the Yanshan Mountains. Triangles of different colors correspond to samples collected using the three different collection methods (n = 74, red: sweep netting; yellow: Malaise traps; black: light traps). performed using fastq_filter, with a maximum expected error of 0.5. Derep_fulllength was used for dereplicating. Uchime_denovo was used to further remove the chimeras. High-quality sequences were then clustered into operational taxonomic units (OTUs) using cluster_size with a sequence identity threshold of 97% (Creedy et al., 2019;Meier et al., 2006). Singletons of all specimens were removed, and the complete OTU table generated.

| Statistical analysis
For data obtained from NCBI contained more unique genus and species labels than BOLD arthropod references (Robeson et al., 2021), the representative sequences of each OTU were blasted against the nt database (ftp://ftp.ncbi.nih.gov/blast/ db/) with "blastn" (Camacho et al., 2009). The complete OTU table was then analyzed using QIIME2 (Bolyen et al., 2019) to reveal the community composition both of each individual sample and of the Yanshan Mountains community taken as a whole. The OTUs identified as not insects were then removed. Rarefication of all samples was performed to check the sequencing depth (Heck et al., 1975). Using QIIME2, all samples were rarefied with a threshold set to 95% of the number of sequences of the sample with the lowest count (19,117). The insect OTU table was then generated and the rarefaction curves were obtained.
Venn diagram and biodiversity (alpha and beta) analyses based on the insect OTU table were carried out in R v3.6 (R Development Core Team, 2017). The Venn diagrams were generated in VennDiagram (Chen & Boutros, 2011). The alpha diversity indices (Chao1, Good's coverage, Pielou's evenness, Shannon and Simpson) were estimated with ggplot2 (Ginestet, 2011). Chao1 is used for estimating the number of species actually present in the insect community. Good's coverage for the proportion of non-singleton OTUs, Pielou's evenness for the evenness, and both Shannon and Simpson for the diversity.
The results of the five alpha diversity indices were subjected to Kruskal-Wallis tests (one for each of the collection-method groups) (Theodorsson-Norheim, 1986) and Dunn's tests (one for each pairwise comparison) (Dinno, 2017). Differences in OTU composition among collection methods and habitat types were examined using non-metric multidimensional scaling (NMDS) based on Bray-Curtis and Jaccard dissimilarity matrices in vegan (Dixon, 2003); analysis of similarities (ANOSIM) tests were also applied using 999 permutations. Heatmaps were produced using pheatmap (Kolde, 2019).

| Order level
Apart from 74 OTUs that were not identified at the order level, all insect OTUs were classified into 16 orders. Order level community composition diagrams of each sample and for the whole Yanshan LT14; Hemiptera dominated S11, S13, S15, S17, S25, S28 and MT14 ( Figure 2b).
There was no significant correspondence between species richness and abundance, with OTUs obtained as follows: Acrididae

| Alpha diversity analyses
After rarefying, 6447 OTUs were obtained (Table S5) (Table S6): The Chao1 index showed the highest richness of species to be in S7, S3 and S1 and the lowest in S16, LT21 and LT29; Good's coverage showed that high coverage (>98%) was achieved in samples; Pielou's evenness showed S7 to have the greatest evenness and LT6 to have the least; Simpson and Shannon diversity indices showed the insect biodiversity to be highest in S7 and lowest in LT6.  Orthoptera Diptera Coleoptera Hemiptera Lepidoptera Hymenoptera Mantodea Ephemeroptera Odonata Blattodea Others S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 1 0 S 1 1 S 1 2 S 1 3 S 1 4 S 1 5 S 1 6 T 2 8 M T 2 9 M T 3 0 L T 1 L T 3 L T 5 L T 7 L T 9 L T 1 1 L T 1 3 L T 1 5 L T 1 9 L T 2 0 L T 2 1 L T 2 3 L T 2 5 L T 2 7 L T 2 9 Heat maps of all the OTUs that could be identified to order or family level are displayed in their collection-method groups in Figure S6. woodland 25, scrubland 21, wetland 7, farmland 6, grassland 0 (Table S8).

| Group comparison analyses
The results of the five alpha diversity indices, Kruskal-Wallis tests, and Dunn's tests are displayed in Figure S7. with the exception of Simpson, which showed nonsignificant F I G U R E 3 Alpha diversity estimates of the three collection-method groups. Box plots display the first and third quartiles and the median, maximum and minimum observed values within each dataset. The number under each diversity index label is the p-value of the Kruskal-Wallis test. The short line represents the significant difference between different collection-method groups calculated using Dunn's test. * represents the degree of difference: *p < .05, **p < .01, ***p < .001.

F I G U R E 4 NMDS analysis of insect community similarity among
the different collection-method groups based on a Bray-Curtis dissimilarity matrix.    (Table S9).
Heat maps of all the OTUs that could be identified to order or genus level are displayed according to their habitat types in Figure 6.
Large variations exist in the species detected across all the samples. At order level, high abundance was identified in habitat types as follows: Blattodea and Dermaptera in scrubland; Neuroptera, Lepidoptera, Psocoptera, Trichoptera and Thysanoptera in woodland; Orthoptera and Diptera in wetwood; Coleoptera, Odonata,

| DISCUSS ION
We completed a project of large-scale bio-monitoring for the Yanshan Mountains, using a 313 bp fragment of the partial COI gene and Illumina high-throughput sequencing, illustrating how different collection methods may produce different characterizations of a single community. It is clear that the methods of collection had biases towards certain taxa in our study. If bio-monitoring is to be effective, then understanding how various collection methods may affect the composition of the resulting samples is key to the future study of insect biodiversity. Although metabarcoding is indeed a promising approach for the study of global biodiversity (Cristescu, 2014;Zhang et al., 2022), a comprehensive understanding of sampling methodology is central to its future accuracy.

| Samples preparation and PCR amplification
It is thought that the relative read abundance is influenced by methodological issues than by ecological issues (Deagle et al., 2019;Piñol et al., 2019). In a study of complex arthropod communities, Creedy et al. (2019) concluded that OTU recovery was not skewed by variations in the biomass of the constituent species, as long as sequencing depth was sufficient and specimens were within reasonable size ranges. However, they point out that all primers target specific taxa, and that therefor in any PCR-based study of community composition, choice of primer will have an inevitable influence on results: the biases introduced by choices of primer are therefore likely to affect the evaluation of species richness, but not of turnover. In future biodiversity studies, these limitations may be overcome to some degree by the use of, single or in combination: PCR-free library construction, single-molecule real-time sequencing, and strategies based on combining multiple sets of primers.
The ecological and biological issues might also introduce bias for the relative read abundance. In this study, the community composition results demonstrated that the dominant order, family and genus were Orthoptera, Tettigoniidae and Atlanticus, respectively. One reason for that may be Orthopterans occupied relative larger individual can result in more similar read numbers than many small ones. Sorting bulk samples by specimen biomass or body size could improve taxa detection using DNA metabarcoding (Elbrecht et al., 2017). However, we chose the current method as it makes less susceptible to contamination, and the purpose of the study was not to detect small and rare taxa in the bulk samples, which were also discussed by Elbrecht et al. and other researchers (Aylagas et al., 2016;Klunder et al., 2022).

| Molecular biodiversity assessment and community composition analysis
In our study, most samples reached saturation, indicating that the sequencing depth was appropriate for diversity analyses. Most A small number of OTUs (344, or 4.63%) were identified as noninsect, and included bacteria, fungi, himatismenida and oomycetes.
One possible interpretation could be that these OTUs were attached to or had been ingested by the insect specimens. It is also possible that the OTUs identified as being bacteria are actually errors that come from the blastn process, misidentified as a result of too-short fragments and insufficient database references. Recent studies have shown that the primers used in this study are capable of amplifying fungi DNA, even though they were primarily designed for metazoan (Leray & Knowlton, 2015;Zenker et al., 2020).
The results of community composition analyses at order level indicate a correlation between abundance and richness the rankings, although at the family and genus levels, no such correlation was noted. If the correlation at order-level between richness and abundance is a meaningful finding rather than an anomaly, it can probably only be illuminated through a further and more detailed analysis of the insect biodiversity of the Yanshan Mountains.

| Collection of samples collected via different collection methods
In our study, species richness was the highest in the samples obtained via sweep netting, followed by Malaise traps and light traps respectively. Reasons for differences in species richness between collection methods can be hypothesized, but as with some other aspects of this initial study, only further research can form the basis for anything more solid than speculation. One possible explanation may be that suggested by Yi et al. (2012), which is that methods of collection that minimize tailoring to specific behavior and attract-  (Nag & Nath, 2010;Thein & Choi, 2016).
These differences among the specimens obtained via alternative collection methods were underlined by our finding that differences in community composition among the three collection methods were both significant and greater than those within each group.
The use of all three collection methods resulted in obtaining of specimens that would not have been gathered using only two.

| Comparison of samples collected from different habitat types
Alpha diversity indices showed that species richness, community evenness and diversity were highest in the wetland habitats and lowest in grassland, indicating that the community structure is most stable in the former and least stable in the latter. The stability of insect community structure in farmland and grassland appears to both be relatively weak and therefore present cause for concern. Further investigation of the insect communities is therefore warranted for the purpose of formulating interventions to promote biodiversity in the habitat types.

| CON CLUS IONS
In this study, we established initial insect community compositions of the Yanshan Mountains and conducted comparison of three collection methods and five different habitats. Based on our results, we confirm that DNA metabarcoding provides a cost-and time-efficient way to recover information on the species composition of insect communities. However, the collection methods, crucial for final DNA capture, remain both unstandardized and incompletely understood. The differences we observed among samples collected using different methods indicate poorly-understood factors in the obtaining of bio-monitoring data. A more comprehensive understanding of sampling methodology is therefore central to the future accuracy and efficacy of metabarcoding, particularly in the field of biodiversity analysis.

ACK N OWLED G M ENTS
We thank Mr. Edward Hughes for his considerable contribution to the structuring and writing of this paper. We are grateful to the re-

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5061/ dryad.x3ffb g7p7.